%%HTML
<script src="require.js"></script>
Project Overview:¶
This project involves a comprehensive analysis of loan data, specifically focusing on data from ID/X Partners. The dataset contains various features related to loans, including loan amounts, interest rates, borrower information, and loan status. The goal is to gain insights into the factors influencing loan performance, visualize trends over time, and make informed suggestions for optimizing business strategies.
Import necessary libraries¶
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default='notebook'
Data preprocessing¶
# Load the dataframe
loan_data = pd.read_csv(r'/workspaces/idx_finalproject/resources/loan_data_2007_2014.csv', low_memory=False)
# Print dataframe head row
loan_data.head()
| Unnamed: 0 | id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | ... | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1077501 | 1296599 | 5000 | 5000 | 4975.0 | 36 months | 10.65 | 162.87 | B | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | 1077430 | 1314167 | 2500 | 2500 | 2500.0 | 60 months | 15.27 | 59.83 | C | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2 | 1077175 | 1313524 | 2400 | 2400 | 2400.0 | 36 months | 15.96 | 84.33 | C | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 3 | 1076863 | 1277178 | 10000 | 10000 | 10000.0 | 36 months | 13.49 | 339.31 | C | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4 | 1075358 | 1311748 | 3000 | 3000 | 3000.0 | 60 months | 12.69 | 67.79 | B | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 75 columns
# Display dataframe info
loan_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 466285 entries, 0 to 466284 Data columns (total 75 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 466285 non-null int64 1 id 466285 non-null int64 2 member_id 466285 non-null int64 3 loan_amnt 466285 non-null int64 4 funded_amnt 466285 non-null int64 5 funded_amnt_inv 466285 non-null float64 6 term 466285 non-null object 7 int_rate 466285 non-null float64 8 installment 466285 non-null float64 9 grade 466285 non-null object 10 sub_grade 466285 non-null object 11 emp_title 438697 non-null object 12 emp_length 445277 non-null object 13 home_ownership 466285 non-null object 14 annual_inc 466281 non-null float64 15 verification_status 466285 non-null object 16 issue_d 466285 non-null object 17 loan_status 466285 non-null object 18 pymnt_plan 466285 non-null object 19 url 466285 non-null object 20 desc 125981 non-null object 21 purpose 466285 non-null object 22 title 466264 non-null object 23 zip_code 466285 non-null object 24 addr_state 466285 non-null object 25 dti 466285 non-null float64 26 delinq_2yrs 466256 non-null float64 27 earliest_cr_line 466256 non-null object 28 inq_last_6mths 466256 non-null float64 29 mths_since_last_delinq 215934 non-null float64 30 mths_since_last_record 62638 non-null float64 31 open_acc 466256 non-null float64 32 pub_rec 466256 non-null float64 33 revol_bal 466285 non-null int64 34 revol_util 465945 non-null float64 35 total_acc 466256 non-null float64 36 initial_list_status 466285 non-null object 37 out_prncp 466285 non-null float64 38 out_prncp_inv 466285 non-null float64 39 total_pymnt 466285 non-null float64 40 total_pymnt_inv 466285 non-null float64 41 total_rec_prncp 466285 non-null float64 42 total_rec_int 466285 non-null float64 43 total_rec_late_fee 466285 non-null float64 44 recoveries 466285 non-null float64 45 collection_recovery_fee 466285 non-null float64 46 last_pymnt_d 465909 non-null object 47 last_pymnt_amnt 466285 non-null float64 48 next_pymnt_d 239071 non-null object 49 last_credit_pull_d 466243 non-null object 50 collections_12_mths_ex_med 466140 non-null float64 51 mths_since_last_major_derog 98974 non-null float64 52 policy_code 466285 non-null int64 53 application_type 466285 non-null object 54 annual_inc_joint 0 non-null float64 55 dti_joint 0 non-null float64 56 verification_status_joint 0 non-null float64 57 acc_now_delinq 466256 non-null float64 58 tot_coll_amt 396009 non-null float64 59 tot_cur_bal 396009 non-null float64 60 open_acc_6m 0 non-null float64 61 open_il_6m 0 non-null float64 62 open_il_12m 0 non-null float64 63 open_il_24m 0 non-null float64 64 mths_since_rcnt_il 0 non-null float64 65 total_bal_il 0 non-null float64 66 il_util 0 non-null float64 67 open_rv_12m 0 non-null float64 68 open_rv_24m 0 non-null float64 69 max_bal_bc 0 non-null float64 70 all_util 0 non-null float64 71 total_rev_hi_lim 396009 non-null float64 72 inq_fi 0 non-null float64 73 total_cu_tl 0 non-null float64 74 inq_last_12m 0 non-null float64 dtypes: float64(46), int64(7), object(22) memory usage: 266.8+ MB
# Drop irrelevant columns
loan_data.drop(['Unnamed: 0', 'url'], axis=1, inplace=True)
# Convert date-value columns to datetime with explicit date format
loan_data['issue_d'] = pd.to_datetime(loan_data['issue_d'], format='%b-%y')
loan_data['next_pymnt_d'] = pd.to_datetime(loan_data['next_pymnt_d'], format='%b-%y')
loan_data['last_pymnt_d'] = pd.to_datetime(loan_data['last_pymnt_d'], format='%b-%y')
loan_data['last_credit_pull_d'] = pd.to_datetime(loan_data['last_credit_pull_d'], format='%b-%y')
# Print datetime columns
print(loan_data[['issue_d', 'next_pymnt_d', 'last_pymnt_d', 'last_credit_pull_d']])
issue_d next_pymnt_d last_pymnt_d last_credit_pull_d 0 2011-12-01 NaT 2015-01-01 2016-01-01 1 2011-12-01 NaT 2013-04-01 2013-09-01 2 2011-12-01 NaT 2014-06-01 2016-01-01 3 2011-12-01 NaT 2015-01-01 2015-01-01 4 2011-12-01 2016-02-01 2016-01-01 2016-01-01 ... ... ... ... ... 466280 2014-01-01 2016-02-01 2016-01-01 2016-01-01 466281 2014-01-01 NaT 2014-12-01 2016-01-01 466282 2014-01-01 2016-02-01 2016-01-01 2015-12-01 466283 2014-01-01 NaT 2014-12-01 2015-04-01 466284 2014-01-01 2016-02-01 2016-01-01 2016-01-01 [466285 rows x 4 columns]
# Handling missing data
# Identify numerical and categorical columns
numerical_columns = loan_data.select_dtypes(include=['float64', 'int64']).columns
categorical_columns = loan_data.select_dtypes(include=['object']).columns
# Fill missing values in numerical columns with the mean
loan_data[numerical_columns] = loan_data[numerical_columns].fillna(loan_data[numerical_columns].mean())
# Fill missing values in categorical columns with "Not specified"
loan_data[categorical_columns] = loan_data[categorical_columns].fillna("Not specified")
Initial findings¶
# Initial findings
# Select columns to find initial insights
use_cols = ['loan_amnt', 'funded_amnt', 'int_rate', 'grade',
'annual_inc', 'dti', 'open_acc', 'pub_rec',
'revol_bal', 'total_acc', 'loan_status', 'int_rate',
'issue_d', 'next_pymnt_d', 'last_pymnt_d']
df = loan_data[use_cols]
# Visualize loan grade distribution
plt.figure(figsize=(10, 6))
grades_order = sorted(df['grade'].unique())
grade_counts = df['grade'].value_counts()[grades_order]
plt.bar(grades_order, grade_counts)
plt.title('Distribution of Loan Grades')
plt.xlabel('Loan Grade')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
# Visualize the distribution of issue date based on loan status
plt.figure(figsize=(10, 6))
loan_status_values = df['loan_status'].unique()
for status in loan_status_values:
plt.hist(df[df['loan_status'] == status]['issue_d'], bins=30, alpha=0.5, label=status)
plt.title('Distribution of Loan Issue Dates by Loan Status')
plt.xlabel('Issue Date')
plt.ylabel('Number of Loans')
plt.legend()
plt.tight_layout()
plt.show()
# Visualize the distribution of loan amounts based on loan status
plt.figure(figsize=(10, 6))
loan_status_values = df['loan_status'].unique()
for status in loan_status_values:
plt.hist(df[df['loan_status'] == status]['loan_amnt'], bins=30, alpha=0.5, label=status)
plt.title('Distribution of Loan Amounts by Loan Status')
plt.xlabel('Loan Amount')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
# Visualize the distribution of interest rates based on loan status
plt.figure(figsize=(10, 6))
for status in loan_status_values:
plt.hist(df[df['loan_status'] == status]['int_rate'], bins=20, alpha=0.5, label=status)
plt.title('Distribution of Interest Rates by Loan Status')
plt.xlabel('Interest Rate')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
# Explore the relationship between annual income and debt-to-income ratio
plt.figure(figsize=(10, 6))
for status in loan_status_values:
plt.scatter(df[df['loan_status'] == status]['annual_inc'],
df[df['loan_status'] == status]['dti'],
label=status, alpha=0.7)
plt.title('Relationship between Annual Income, DTI, and Loan Status')
plt.xlabel('Annual Income')
plt.ylabel('Debt-to-Income Ratio')
plt.legend()
plt.tight_layout()
plt.show()
Feature engineering¶
# Feature engineering
# Loan amount to income ratio
loan_data['loan_inc_ratio'] = loan_data['loan_amnt'] / loan_data['annual_inc']
# Distribution plot
plt.figure(figsize=(10, 6))
loan_data['loan_inc_ratio'].hist(bins=20)
plt.xlabel('Loan to Income Ratio')
plt.title('Distribution of Loan to Income Ratio')
plt.tight_layout()
plt.show()
# Extract year and month from date columns
loan_data['issue_year'] = loan_data['issue_d'].dt.year
loan_data['issue_month'] = loan_data['issue_d'].dt.month
loan_data['next_pymnt_year'] = loan_data['next_pymnt_d'].dt.year
loan_data['next_pymnt_month'] = loan_data['next_pymnt_d'].dt.month
loan_data['last_pymnt_year'] = loan_data['last_pymnt_d'].dt.year
loan_data['last_pymnt_month'] = loan_data['last_pymnt_d'].dt.month
loan_data['last_credit_pull_year'] = loan_data['last_credit_pull_d'].dt.year
loan_data['last_credit_pull_month'] = loan_data['last_credit_pull_d'].dt.month
# Extracting loan term from 'term' column
loan_data['loan_term'] = loan_data['term'].apply(lambda x: int(x.split()[0]))
# Calculate the average interest rate
avg_interest_rate = loan_data['int_rate'].mean()
# Create a new feature for interest rate difference
loan_data['int_rate_diff'] = loan_data['int_rate'] - avg_interest_rate
# Calculate debt-to-income ratio
loan_data['dti_ratio'] = loan_data['dti'] / loan_data['annual_inc']
# Calculate credit utilization ratio
loan_data['credit_util_ratio'] = loan_data['revol_bal'] / loan_data['total_rev_hi_lim']
# Create a binary indicator for loan status (1 for 'Charged Off', 0 for others)
loan_data['loan_status_binary'] = (loan_data['loan_status'] == 'Charged Off').astype(int)
# Analyze loan issuance trends over time
# Group data by issue year and month
issue_date_counts = loan_data.groupby(['issue_year', 'issue_month']).size().unstack()
# Calculate default rates by issue year and month
default_rates_by_date = loan_data.groupby(['issue_year', 'issue_month'])['loan_status_binary'].mean().unstack()
# Plot loan issuance trends and default rates side by side horizontally
fig, axes = plt.subplots(1, 2, figsize=(18, 8))
# Plot loan issuance trends
sns.heatmap(issue_date_counts, cmap='Blues', annot=True, fmt='g', linewidths=.5, ax=axes[0])
axes[0].set_title('Loan Issuance Trends Over Time')
axes[0].set_xlabel('Issue Month')
axes[0].set_ylabel('Issue Year')
# Plot default rates over time
sns.heatmap(default_rates_by_date, cmap='Reds', annot=True, fmt='.2%', ax=axes[1])
axes[1].set_title('Default Rates Over Time')
axes[1].set_xlabel('Issue Month')
axes[1].set_ylabel('Issue Year')
plt.tight_layout()
plt.show()
The most noticeable insight is that in 2014, there was a significant increase in loan issuances, particularly in October, where over 38,000 loans were issued. This suggests a surge in lending activity during that specific month and year. possibly indicating a strategic move by the lending institution to boost lending during that period. Interestingly, the heatmap shows that the default rates were highest in the years 2011 and 2012, suggesting that loans issued during these periods were more likely to default. The variation in default rates across different years highlights potential patterns or trends that may influence credit risk and loan performance.
# Get the distribution of loan terms
term_counts = loan_data['loan_term'].value_counts()
# Plot the distribution of loan terms
plt.figure(figsize=(12, 5))
# Plot loan terms
plt.subplot(1, 2, 1)
plt.bar(term_counts.index.astype(str), term_counts, color='skyblue')
plt.title('Distribution of Loan Terms')
plt.xlabel('Loan Term (Months)')
plt.ylabel('Number of Loans')
# Plot a pie chart for loan terms
plt.subplot(1, 2, 2)
plt.pie(term_counts, labels=term_counts.index.astype(str), autopct='%1.1f%%', colors=['lightcoral', 'lightskyblue'])
plt.title('Loan Term Distribution')
plt.axis('equal')
plt.tight_layout()
plt.show()
The bar chart and pie chart collectively reveal that the majority of loans in the dataset have a 36-month term, constituting a substantial portion of the loan portfolio. Conversely, the 60-month term, while still popular, represents a smaller proportion of the total loans.
# Compare individual loan interest rates to the average interest rate
# Plot the distribution of individual loan interest rate differences
plt.figure(figsize=(10, 6))
plt.hist(loan_data['int_rate_diff'], bins=30, alpha=0.5, color='skyblue', label='Interest Rate Differences')
plt.axvline(0, color='red', linestyle='dashed', linewidth=2, label='Average Interest Rate')
plt.title('Distribution of Interest Rate Differences from Average')
plt.xlabel('Interest Rate Difference')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
Correlation analysis¶
# Select relevant columns for correlation analysis
selected_columns_with_engineered = ['loan_amnt', 'funded_amnt', 'int_rate', 'annual_inc',
'dti', 'open_acc', 'pub_rec', 'revol_bal', 'total_acc',
'loan_status_binary', 'issue_year', 'issue_month',
'loan_term', 'int_rate_diff', 'dti_ratio', 'credit_util_ratio']
# Calculate and visualize correlation matrix
correlation_matrix_with_engineered = loan_data[selected_columns_with_engineered].corr()
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix_with_engineered, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix with Engineered Features')
plt.tight_layout()
plt.show()
- Strong positive correlations:
- loan_amnt and funded_amnt (0.99): This is to be expected, as the funded amount is typically a portion of the loan amount requested.
- annual_inc and loan_amnt (0.37): Borrowers with higher incomes tend to take out larger loans.
- total_acc and open_acc (0.67): Borrowers with more open accounts also tend to have more total accounts.
- loan_term and int_rate_diff (0.44): Borrowers with longer loan terms tend to have higher interest rate differences compared to the average.
- Strong negative correlations:
- dti and annual_inc (-0.67): Borrowers with higher debt-to-income ratios tend to have lower incomes.
- dti and credit_util_ratio (-0.40): Borrowers with higher debt-to-income ratios also tend to have lower credit utilization ratios.
- int_rate and dti_ratio (-0.29): Borrowers with higher debt-to-income ratios tend to have lower interest rates.
- Weak correlations:
- loan_status_binary with most other features: This suggests that loan status is not strongly related to most of the other features in the dataset.
- issue_month with most other features: This suggests that the month in which the loan was issued is not strongly related to most of the other features in the dataset.
# Group by 'addr_state' and count the number of loans
loan_states = loan_data.groupby('addr_state')['id'].count().reset_index(name='loans')
# Plot the bar chart
fig, ax = plt.subplots(figsize=(10, 6))
top_10 = loan_states.nlargest(10, 'loans')
top_10.plot.bar(x='addr_state', y='loans', rot=0, ax=ax)
plt.title("Top 10 States by Number of Loans")
plt.xlabel("State")
plt.ylabel("Number of Loans")
plt.tight_layout()
plt.show()
# Plot the choropleth map
fig = px.choropleth(loan_states,
locations='addr_state',
locationmode='USA-states',
color='loans',
scope='usa',
title='Number of Loans by State',
color_continuous_scale='blues')
# Center and adjust layout:
fig.update_layout(
margin=dict(l=0, r=0, t=50, b=0),
geo=dict(
showframe=False,
projection_scale=1
)
)
fig.show()
The map reveals that California (CA) has the highest number of loans, with over 70,000 loans. New York (NY) follows with around 40,000 loans, and Texas (TX) is close behind with approximately 36,000 loans.
- Focus on 36-Month Terms:
The majority of loans in the dataset have a 36-month term. Consider continuing to offer and promote these shorter-term loans, as they are more popular among borrowers.
- Strategic Issuance in High-Performing Months:
Analyze the factors contributing to the surge in loan issuances in specific months, such as the notable increase in October 2014. Consider replicating successful strategies during those high-performing months.
- Manage Default Rates:
Address the higher default rates observed in 2011 and 2012. Evaluate the underwriting criteria during those periods and identify any patterns or risk factors contributing to increased default rates. Implement strategies to mitigate default risks.
- Diversify Marketing Strategies:
Explore and diversify marketing strategies to attract a broader customer base. Understand the preferences and characteristics of borrowers with successful repayment histories and target similar demographics.
- Optimize Interest Rates:
Analyze the distribution of interest rates and their impact on loan performance. Consider optimizing interest rates to remain competitive while ensuring profitability and managing default risks.
- Enhance Geographic Presence:
Assess the geographic distribution of loans and identify regions with high demand. Consider expanding or refining the company's presence in those areas to tap into potential markets.
- Customer Education and Transparency:
Enhance customer education regarding loan terms, interest rates, and repayment processes. Transparency in communication can build trust and improve customer satisfaction.
- Adapt to Economic Conditions:
Stay vigilant to economic conditions that may impact borrower repayment capabilities. Develop adaptive strategies to navigate economic fluctuations and minimize the impact on loan performance.
- Continuous Monitoring and Adaptation:
Establish a robust monitoring system to track key performance indicators over time. Regularly revisit and adapt business strategies based on evolving market conditions and customer behaviors.
- Optimize Current Loan Portfolio:
Since a significant portion of loans is labeled as "Current," focus on optimizing the management and servicing of these loans. Implement strategies to minimize the risk of loans transitioning into delinquency.
- Customer Engagement for Fully Paid Loans:
For loans labeled as "Fully Paid," leverage this positive outcome to engage with customers. Consider soliciting feedback, offering loyalty incentives for repeat business, and encouraging satisfied customers to refer others.
- Mitigate Risk for Delinquent Loans:
Given the lower frequencies in other loan statuses, pay special attention to loans with delinquent statuses. Implement proactive measures such as targeted communication, flexible repayment plans, or refinancing options to mitigate the risk of default.
- Enhance Communication Strategies:
Improve communication strategies for borrowers across different loan statuses. Provide clear and transparent communication regarding payment schedules, outstanding balances, and any available support or resources in case of financial difficulties.
- Customized Marketing for Different Loan Statuses:
Tailor marketing efforts based on the loan status. For instance, develop campaigns to encourage borrowers with fully paid loans to consider additional financial products or services. For borrowers with delinquent status, provide resources to help them get back on track.
- Implement Predictive Analytics:
Utilize predictive analytics to forecast the likelihood of loans transitioning into different statuses. This can aid in early intervention strategies for loans showing signs of potential issues.
- Customer Retention Programs:
Design customer retention programs for borrowers with fully paid loans to foster long-term relationships. Offer exclusive benefits, promotions, or loyalty programs to incentivize them to choose the company for future financial needs.
- Diversify Loan Products:
Explore diversification of loan products to attract a wider range of customers. Different loan products may appeal to various segments of the market, potentially reducing the concentration of loans in the "Current" status.